For the past few weeks, the Camp Fire in Northern California has been a deadly and destructive force, recieving enormous media coverage. This deadly wildfire that began on November 8, 2018 has left 85 people dead, 249 missing, and destroyed thousands of buildings. It has resulted in major air quality issues, causing many to wear masks or stay indoors to protect their health. But this is not the first time a major wildfire has ravaged the west coast of the United States.
In this tutorial, we will analyze west coast wildfire data from 2010 to 2016, available through the University of California Irvine Data Science Initiative GitHub organization here. This data has been obtained through the NASA Moderate Resolution Imaging Spectroradiometer (MODIS) satellite. We will merge this wildfire data with daily weather data obtained from NOAA Climate Data Online tool.
We will begin by loading the data and performing exploratory data analysis before diving deeper into regression analysis and classification analysis using basic machine learning methods. These data science methods will allow us to look at past trends and make inferences about future west coast wildfires.

We will begin by importing Python packages that will help us to load and wrangle the data:
pandas: a package that allows us to build data frames, a tabular data format that can be thought of like a spreadsheetnumpy: a package containing matrix and vector mathematical functionsimport pandas as pd
import numpy as np
The wildfire dataset is contained in a comma-separated values file (CSV) located on GitHub. Pandas has a read_csv function that allows us to load a data frame from a CSV over the web. We will specify the data types of each column using a python dict to ensure that pandas loads the data using the data types that we will want to use in our analysis.
data_url = 'https://raw.githubusercontent.com/UCIDataScienceInitiative/Climate_Hackathon/master/west_coast_fires/fires.csv'
dtype = {
'confidence': float,
'day': int,
'frp': float,
'lat': float,
'lon': float,
'month': int,
'year': int,
'x': float,
'y': float,
'dayofyear': int,
'vdp': float,
'temp': float,
'humidity': float
}
df = pd.read_csv(data_url, dtype=dtype)
df.head()
We can use the shape variable to learn how many fire observations are contained in this dataset:
df.shape
The next step is to determine what this wildfire data means so that we can perform exploration. Luckily, the README for the GitHub repository provides us with a description of each column:
id: unique identifier of the fire in the overall context of the world datasetconfidence: how much confidence the satellite has that this is actually a fire detection (percent)day: the day of the monthfrp: Fire Radiative Power, the strength of the firelat: latitudelon: longitudemonth: month of the yearyear: yearx: x position in a uniformly-spaced gridy: y position in a uniformly-spaced griddayofyear: day of the year (from 0 to 364 or 365 for leap years)vpd: Vapor Pressure Deficit, the difference between the moisture in the air and the amount of moisture the air could holdtemp: temperature (degrees Kelvin)humidity: humidity (percent)To be extra cautious, we can restrict our analysis to those fires with high confidence.
confidence_threshold = 0.8
df = df.loc[df['confidence'] > confidence_threshold]
We can observe the size of our data after filtering by confidence threshold.
df.shape
We will also want to exclude fires with a strength of zero.
frp_threshold = 1
df = df.loc[df['frp'] > frp_threshold]
We can again observe the size of our data, this time after filtering by frp threshold.
df.shape
The wildfire dataset README notes that the same fire may be split into separate observations if it spreads over multiple locations or last multiple days. We can try to remedy this by consolidating fire observations which:
According to the Wildfire page on Wikipedia, forest fires can spread as fast as 10.8 km per hour. Using this information, we can consider that in a single day, it is possible that a fire could spread 259 kilometers:
wildfire_max_daily_spread = 10.8*24
wildfire_max_daily_spread
To easily identify consecutive days, we will first add a column called dayofdataset which is similar to dayofyear but does not restart at the end of the year.
min_year = df['year'].min()
max_year = df['year'].max()
df['dayofdataset'] = df.apply(lambda row: (row['year'] - min_year) * 365 + row['dayofyear'], axis='columns')
Next we will take the difference of dayofdataset for each each row with the previous row. The resulting groups of rows that are 1 day or 0 days apart will be the candidates for our consolidation procedure.
df['dayofdataset_diff'] = df['dayofdataset'].diff()
df.head()
Based on this date difference column, we can give each group of consecutive dates a different index, so that we can use groupby to iterate over each group of dates.
df = df.reset_index(drop=True)
df['date_group_index'] = np.nan
curr_date_group_index = 1
for index, row in df.iterrows():
if pd.isna(row['dayofdataset_diff']) or row['dayofdataset_diff'] <= 1.0:
df.at[index, 'date_group_index'] = curr_date_group_index
else:
curr_date_group_index += 1
df.head(10)
Here, we will iterate through each "date group" and identify each "fire group" contained within, based on computation of pairwise fire distances. We will use python's itertools to determine the fire pairs in a group.
import itertools
We can use the Haversine Formula to compute distances between pairs of fires using their latitude and longitude values.
# Adapted from this StackOverflow post: https://stackoverflow.com/a/4913653
def haversine(lon1, lat1, lon2, lat2):
"""
Calculate the great circle distance between two points
on the earth (specified in decimal degrees)
"""
# convert decimal degrees to radians
lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])
# haversine formula
dlon = lon2 - lon1
dlat = lat2 - lat1
a = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2
c = 2 * np.arcsin(np.sqrt(a))
r = 6371 # radius of earth in kilometers
return c * r
Those fires that are within 259 kilometers per day of each other will be placed into the same "fire group" and be given the same fire_group_index value.
Unfortunately, even after splitting by "date group", the number of pairs we need to find distances between is still too large. For the sake of limited computational power we will just set a maximum group size and consider the observations as separate fires if the number of elements in a "date group" is too large.
max_date_group_size = 2000
Now we will iterate over every date group and compute distances between each pair of fire observations, assigning to a new fire group index column if multiple observations should be considered the same fire.
curr_max_fire_group_index = df.shape[0]
df['fire_group_index'] = np.arange(curr_max_fire_group_index)
curr_max_fire_group_index += 1
for date_group_index, date_group_df in df.groupby(['date_group_index']):
if pd.notnull(date_group_index) and date_group_df.shape[0] < max_date_group_size:
fire_pairs = itertools.combinations(list(date_group_df.index.values), 2)
group_pairs = {}
for index_a, index_b in fire_pairs:
dist = haversine(
date_group_df.loc[index_a]['lon'],
date_group_df.loc[index_a]['lat'],
date_group_df.loc[index_b]['lon'],
date_group_df.loc[index_b]['lat']
)
if dist <= wildfire_max_daily_spread*(1+np.absolute(date_group_df.loc[index_a]['dayofdataset'] - date_group_df.loc[index_b]['dayofdataset'])):
if index_a in group_pairs.keys():
df.at[index_b, 'fire_group_index'] = group_pairs[index_a]
elif index_b in group_pairs.keys():
df.at[index_a, 'fire_group_index'] = group_pairs[index_b]
else:
curr_max_fire_group_index += 1
group_pairs[index_a] = curr_max_fire_group_index
group_pairs[index_b] = curr_max_fire_group_index
df.at[index_a, 'fire_group_index'] = curr_max_fire_group_index
df.at[index_b, 'fire_group_index'] = curr_max_fire_group_index
df.head()
Using the NOAA Climate Data Online (CDO) service, we can obtain daily climate data from 2010 to 2016, from weather stations in the west coast. The CDO search tool allows us to select NOAA stations of interest and the types of data we want to download for a specified time period. In this case, I have manually selected 75 weather stations spread throughout the west coast region.
Along with the location (latitude and longitude) of each weather station, I selected to download data comprising the following variables:
Air Temperature
TAVG: Average TemperatureTMAX: Maximum temperatureTMIN: Minimum temperaturePrecipitation
MDPR: Multiday precipitation total (use with DAPR and DWPR, if available)DAPR: Number of days included in the multiday precipitation total (MDPR)PRCP: PrecipitationSNWD: Snow depthSNOW: SnowfallSunshine
PSUN: Daily percent of possible sunshine for the periodTSUN: Total sunshine for the periodWind
AWND: Average wind speedWDF2: Direction of fastest 2-minute windWDF5: Direction of fastest 5-second wind WSF2: Fastest 2-minute wind speedWSF5: Fastest 5-second wind speedPGTM: Peak gust timeFMTM: Time of fastest mile or fastest 1-minute windWeather Type: binary indication of specific weather events
WT01: Fog, ice fog, or freezing fog (may include heavy fog)WT02: Heavy fog or heaving freezing fog (not always distinguished from fog)WT03: ThunderWT04: Ice pellets, sleet, snow pellets, or small hailWT05: Hail (may include small hail)WT06: Glaze or rime WT07: Dust, volcanic ash, blowing dust, blowing sand, or blowing obstructionWT08: Smoke or hazeWT09: Blowing or drifting snowWT10: Tornado, waterspout, or funnel cloudWT11: High or damaging windsWT13: MistWT14: DrizzleWT15: Freezing drizzleWT16: Rain (may include freezing rain, drizzle, and freezing drizzle)WT17: Freezing rainWT18: Snow, snow pellets, snow grains, or ice crystalsWT19: Unknown source of precipitationWT21: Ground fog WT22: Ice fog or freezing fogAfter selecting data, the CDO tool sends an email with our requested data to download. I have done this and place the downloaded file in this GitHub repository so that we can easily load it with pandas.
cdo_df = pd.read_csv('data/1572799.csv')
cdo_df.head()
And taking note of the size of this data frame:
cdo_df.shape
To be able to match this weather data with our wildfire data, we will look at the weather from the closest station to each fire observation.
We will first need to map each station to its location.
cdo_loc_df = cdo_df.drop_duplicates(subset=['STATION'])[['STATION', 'LATITUDE', 'LONGITUDE']].set_index('STATION', drop=True)
cdo_loc_df.head()
Using the haversine formula from above, we can write a function that returns the ID of the closest weather station and the distance (in kilometers), given the coordinates of a fire:
def get_closest_station(fire_lon, fire_lat):
station_distance_series = cdo_loc_df.apply(lambda row: haversine(row['LONGITUDE'], row['LATITUDE'], fire_lon, fire_lat), axis='columns')
return (station_distance_series.idxmin(), station_distance_series.min())
Now we can apply this function along the entire wildfire dataframe, creating two new columns:
df['closest_station'], df['closest_station_dist'] = zip(*df.apply(lambda row: get_closest_station(row['lon'], row['lat']), axis='columns'))
df.head()
The mean distance from fires to the closest weather station is approximately 79 kilometers, or 49 miles:
df['closest_station_dist'].mean()
Turning our attention to the weather dataframe now, we will convert its date values into a dayofyear column so that we can easily cross-reference from the wildfire dataset to determine the weather on the day of a fire.
We will need to import python's datetime to help with date formatting.
import datetime
def cdo_date_to_dayofyear(cdo_date):
date = datetime.datetime.fromisoformat(cdo_date).date()
return int(date.strftime('%-j'))
def cdo_date_to_year(cdo_date):
date = datetime.datetime.fromisoformat(cdo_date).date()
return int(date.strftime('%Y'))
Now we can apply this conversion function to all of the date values in the CDO weather dataset:
cdo_df['dayofyear'] = cdo_df['DATE'].apply(cdo_date_to_dayofyear)
cdo_df['year'] = cdo_df['DATE'].apply(cdo_date_to_year)
cdo_df['dayofdataset'] = cdo_df.apply(lambda row: (row['year'] - min_year) * 365 + row['dayofyear'], axis='columns')
Next, we can compute periods of drought.
We can perform a similar process as we did before when computing groups of dates for the wildfires, and assign unique indices to groups of consecutive observations (from the same weather station) with zero values for the PRCP precipitation column.
cdo_df['drought_group_index'] = np.nan
curr_drought_index = 1
for station_name, station_df in cdo_df.groupby(['STATION']):
for index, row in station_df.iterrows():
if pd.isna(row['PRCP']) or row['PRCP'] == 0.0:
cdo_df.at[index, 'drought_group_index'] = curr_drought_index
else:
curr_drought_index += 1
cdo_df.head()
We can create a new dataframe containing the start day, end day, and length of each drought period:
drought_df = pd.DataFrame(data=[], index=[], columns=['STATION', 'drought_group_index', 'dayofdataset_start', 'dayofdataset_end', 'length'])
for drought_group_index, drought_group_df in cdo_df.groupby(['drought_group_index']):
if pd.notnull(drought_group_index):
drought_group_df_index_vals = list(drought_group_df.index.values)
drought_group_start_day = drought_group_df.loc[drought_group_df_index_vals[0]]['dayofdataset']
drought_group_end_day = drought_group_df.loc[drought_group_df_index_vals[-1]]['dayofdataset']
drought_group_length = (drought_group_end_day - drought_group_start_day)
drought_group_station = drought_group_df.loc[drought_group_df_index_vals[0]]['STATION']
if drought_group_length > 0:
drought_df = drought_df.append({
'STATION': drought_group_station,
'drought_group_index': drought_group_index,
'dayofdataset_start': drought_group_start_day,
'dayofdataset_end': drought_group_end_day,
'length': drought_group_length
}, ignore_index=True)
drought_df.head()
Finally we are ready to merge our drought data with our fire data. We will add a new column to indicate how long of a drought (if any) preceded the fire.
def get_drought_length(row):
fire_drought_df = drought_df.loc[
(drought_df['STATION'] == row['closest_station']) &
(drought_df['dayofdataset_start'] <= row['dayofdataset']) &
(drought_df['dayofdataset_end'] >= row['dayofdataset'])
]
if fire_drought_df.shape[0] == 1:
return (row['dayofdataset'] - fire_drought_df.reset_index(drop=True).loc[0]['dayofdataset_start'])
else:
return 0
df['drought_length'] = df.apply(get_drought_length, axis='columns')
df.head()
Using the "fire groups" we can create a new dataset containing one observation for each group:
grouped_df_columns = [
'fire_group_index',
'fire_group_size',
'day_start',
'day_end',
'frp_mean',
'frp_sum',
'lat_min',
'lat_max',
'lon_min',
'lon_max',
'month_start',
'month_end',
'year_start',
'year_end',
'dayofyear_start',
'dayofyear_end',
'dayofdataset_start',
'dayofdataset_end',
'x_min',
'x_max',
'y_min',
'y_max',
'vpd_mean',
'temp_mean',
'humidity_mean',
'drought_length'
]
grouped_df = pd.DataFrame(data=[], index=[], columns=grouped_df_columns)
for fire_group_index, fire_group_df in df.groupby(['fire_group_index']):
fire_group_df = fire_group_df.reset_index(drop=True)
start_row = fire_group_df.loc[0]
end_row = fire_group_df.loc[fire_group_df.shape[0]-1]
grouped_df = grouped_df.append({
'fire_group_index': fire_group_index,
'fire_group_size': fire_group_df.shape[0],
'day_start': start_row['day'],
'day_end': end_row['day'],
'frp_mean': fire_group_df['frp'].mean(),
'frp_sum': fire_group_df['frp'].sum(),
'lat_min': fire_group_df['lat'].min(),
'lat_max': fire_group_df['lat'].max(),
'lon_min': fire_group_df['lon'].min(),
'lon_max': fire_group_df['lon'].max(),
'month_start': start_row['month'],
'month_end': end_row['month'],
'year_start': start_row['year'],
'year_end': end_row['year'],
'dayofyear_start': start_row['dayofyear'],
'dayofyear_end': end_row['dayofyear'],
'dayofdataset_start': start_row['dayofdataset'],
'dayofdataset_end': end_row['dayofdataset'],
'x_min': fire_group_df['x'].min(),
'x_max': fire_group_df['x'].max(),
'y_min': fire_group_df['y'].min(),
'y_max': fire_group_df['y'].max(),
'vpd_mean': fire_group_df['vpd'].mean(),
'temp_mean': fire_group_df['temp'].mean(),
'humidity_mean': fire_group_df['humidity'].mean(),
'drought_length': start_row['drought_length'],
}, ignore_index=True)
grouped_df.head()
We can look at the results of our grouping procedure by looking at the sizes of groups and how many groups consist of only one observation vs. how many consist of multiple observations:
# Maximum group size
grouped_df['fire_group_size'].max()
# Number of groups with one observation
grouped_df.loc[grouped_df['fire_group_size'] == 1].shape[0]
# Number of groups with multiple observations
grouped_df.loc[grouped_df['fire_group_size'] > 1].shape[0]
Perhaps this number of groups with multiple observations would have been greater if we had not limited the pairwise comparison procedure and had additional computational power.
To begin our exploration, we can look at the distributions of some of our variables.
To visualize data, we can load helpful python visualization packages:
matplotlibseabornimport matplotlib
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
from matplotlib import cm
from matplotlib.patches import Patch
from matplotlib.lines import Line2D
import seaborn as sns
# matplotlib notebook configuration options
%config InlineBackend.figure_format = 'retina'
%matplotlib inline
First we will look at the distribution of the confidence variable. This variable represents how confident we are that the satellite has correctly classified a specific image as a wildfire. This distribution will give us some insight into how much we can trust the results of our analysis. Note that we have already restricted our analysis to data points with a confidence level higher than 0.8.
_ = plt.title("Distribution of Confidence")
_ = plt.xlabel("Confidence")
_ = plt.ylabel("Count")
_ = plt.xlim(0.8, 1.0)
_ = plt.hist(df['confidence'].values)
Based on this distribution and our filtering procedure, we can be confident in our analysis that the majority of our data points truly represent wildfires.
Next, we can look at the Fire Radiative Power variable's distribution to get a sense of how strong fires were between 2010 and 2016.
_ = plt.title("Distribution of Fire Radiative Power")
_ = plt.xlabel("Fire Radiative Power")
_ = plt.ylabel("Count")
_ = plt.hist(df['frp'].values)
This plot is not very helpful. We can tell that the overwhelming amount of our data points come from fires with low FRP values, but there is clearly some data that is not easily visible on this plot. Perhaps if we log scale the y-axis it will be more informative.
_ = plt.title("Distribution of Fire Radiative Power")
_ = plt.xlabel("Fire Radiative Power")
_ = plt.ylabel("log(Count)")
_ = plt.hist(df['frp'].values, log=True)
We can create these same histograms for the vpd, temp, and humidity variables:
_ = plt.title("Distribution of Vapor Pressure Deficit")
_ = plt.xlabel("Vapor Pressure Deficit")
_ = plt.ylabel("Count")
_ = plt.hist(df['vpd'].values)
_ = plt.title("Distribution of Temperature")
_ = plt.xlabel("Temperature (degrees Kelvin)")
_ = plt.ylabel("Count")
_ = plt.hist(df['temp'].values)
_ = plt.title("Distribution of humidity")
_ = plt.xlabel("Humidity (percent)")
_ = plt.ylabel("Count")
_ = plt.hist(df['humidity'].values)
Next, we can look at the distribution over time using the year variable. To create this histogram we will want to bin by year rather than allowing matplotlib to choose the bins automatically, so we can use pandas groupby to get the counts per year explicitly:
year_df = df.groupby(['year']).size().reset_index(name='count')
year_df.head()
Note that now that we have the explicit counts we can use the matplotlib bar rather than hist.
_ = plt.title("Distribution of Year")
_ = plt.xlabel("Year")
_ = plt.ylabel("Count")
_ = plt.bar(year_df['year'].values, year_df['count'].values, 0.65)
Next, we can look at the distribution over the year using the dayofyear variable.
_ = plt.title("Distribution of Day of Year")
_ = plt.xlabel("Day of Year")
_ = plt.ylabel("Count")
_ = plt.hist(df['dayofyear'].values)
This plot shows that our distribution is fairly symmetric and unimodal, centered around day 225/365. But maybe the distribution will be more clear if we are able to view this data by month rather than day.
datetime can be used to format our month numbers into month names by creating a date object with our month number and then specifying an output date string format that only consists of the month name.
def month_num_to_name(month_num):
date = datetime.date(2018, month_num, 1)
return date.strftime('%B')
We will obtain counts of fires per month using pandas groupby.
month_df = df.groupby(['month']).size().reset_index(name='count')
month_df['month'] = month_df['month'].apply(month_num_to_name)
_ = plt.figure(figsize=(14, 6))
_ = plt.title("Distribution of Date")
_ = plt.xlabel("Month of Year")
_ = plt.ylabel("Count")
bar_width = 0.65
_ = plt.bar(month_df['month'].values, month_df['count'].values, bar_width)
Finally we can look at the entire span of time from 2010 to 2016 and count the fires observed each month.
year_month_df = df.groupby(['year', 'month']).size().reset_index(name='count')
year_month_df['year-month'] = year_month_df.apply(lambda row: ("%s %i" % (month_num_to_name(row['month']), row['year'])), axis='columns')
year_month_df.head()
_ = plt.figure(figsize=(14, 6))
_ = plt.title("Distribution of Year and Month")
_ = plt.xlabel("Year and Month")
_ = plt.ylabel("Count")
_ = plt.bar(year_month_df['year-month'].values, year_month_df['count'].values, 0.65)
_ = plt.xticks(year_month_df['year-month'].values[::3], rotation=70)
Because this wildfire data also contains location data, we can view it on a map as well. We will begin our multivariate analysis by looking at the location data in relationship to other variables.
To do so, we will load python packages to assist with geographical data visualization
cartopy: geographical visualization toolkit for matplotlibimport cartopy.crs as ccrs
#from cartopy.io.img_tiles import OSM
from cartopy.io.img_tiles import Stamen
imagery = Stamen(style='terrain-background')
To determine the extent (ranges) of our maps, we can determine the range of the latitude and longitude variables in the wildfire dataset.
min_lon = df['lon'].min()
max_lon = df['lon'].max()
min_lon -= (max_lon - min_lon) * 0.2
min_lat = df['lat'].min()
max_lat = df['lat'].max()
For our visualization, we will plot fires from different years using different colors. To do so, matplotlib's colormap module can be used to map each year to a color of the rainbow.
We will make this conventient for ourselves later by creating a function that will generalize this mapping to any sorted array.
def get_cmap_dict(vals, cmap_name='rainbow', numeric=True):
cmap = cm.get_cmap(cmap_name)
if numeric:
# assume vals already sorted
min_val = vals[0]
max_val = vals[-1]
return dict(zip(vals, [cmap((val - min_val) / (max_val - min_val)) for val in vals]))
else:
len_vals = len(vals)
return dict(zip(vals, [cmap(val / len_vals) for val in range(len_vals)]))
years = list(df['year'].unique())
year_to_color_map = get_cmap_dict(years)
Because the plot could become cluttered and difficult to interpret, we will first restrict the fires to those with the top strength (frp) for each year.
n_per_year = 30
df_top_frp = df.sort_values(['frp'], ascending=False)
df_top_frp_year_groupby = df_top_frp.groupby(['year'])
Next we will need to determine the minimum and maximum strength values, which will help us to make a legend.
min_frp = df_top_frp_year_groupby.min()['frp'].min()
max_frp = df_top_frp_year_groupby.max()['frp'].max()
frp_intervals = np.linspace(max_frp, min_frp, 4, endpoint=False)
Now we will create the plot.
fig = plt.figure(figsize=(12, 10))
ax = fig.add_subplot(1, 1, 1, projection=imagery.crs)
_ = ax.set_extent([min_lon, max_lon, min_lat, max_lat], ccrs.PlateCarree())
_ = ax.add_image(imagery, 8)
frp_scaling_factor = 0.003
for year, df_top_frp_year in df_top_frp_year_groupby:
for index, row in df_top_frp_year.head(n_per_year).iterrows():
_ = plt.plot(
row['lon'],
row['lat'],
marker='o',
color=year_to_color_map[row['year']],
markersize=(frp_scaling_factor*row['frp']),
alpha=0.5,
transform=ccrs.Geodetic()
)
_ = ax.set_title("West Coast Wildfires 2010-2016, Strongest %i per Year" % n_per_year)
year_legend_elements = [Patch(facecolor=year_color, label=year) for year, year_color in year_to_color_map.items()]
year_legend = ax.legend(handles=year_legend_elements, loc='upper left', title='Year')
_ = plt.gca().add_artist(year_legend)
frp_legend_elements = [Line2D([0], [0], marker='o', color=(0,0,0,0), label=np.floor(frp_val), markerfacecolor='#C0C0C0', markersize=frp_val*frp_scaling_factor) for frp_val in frp_intervals]
_ = ax.legend(handles=frp_legend_elements, loc='lower left', labelspacing=2, columnspacing=2, title='Fire Radiative Power')
Next we can perform multivariate analysis of the rest of our data and ask ourselves whether we see relationships between any of our variables.
In the following plot we will look at temperature over time.
_ = plt.figure(figsize=(14, 6))
_ = plt.title("Temperature by Day of Year")
_ = sns.scatterplot(x="dayofyear", y="temp", data=df)
_ = plt.xlabel("Day of Year")
_ = plt.ylabel("Temperature (kelvin)")
Next, we can incorporate year by coloring each point based on year.
_ = plt.figure(figsize=(14, 6))
_ = plt.title("Temperature by Day of Year")
_ = sns.scatterplot(x="dayofyear", y="temp", hue="year", data=df, linewidth=0)
_ = plt.xlabel("Day of Year")
_ = plt.ylabel("Temperature (kelvin)")
It might be easier to look at the means for each year.
year_dayofyear_df = df.groupby(['year', 'dayofyear']).mean().reset_index()
_ = plt.figure(figsize=(14, 6))
_ = plt.title("Mean Temperature by Day of Year")
_ = sns.scatterplot(x="dayofyear", y="temp", hue="year", data=year_dayofyear_df, linewidth=0)
_ = plt.xlabel("Day of Year")
_ = plt.ylabel("Mean Temperature (kelvin)")
year_month_mean_df = df.groupby(['year', 'month']).mean().reset_index()
month_tick_formatter = matplotlib.ticker.FuncFormatter(lambda x, p: month_num_to_name(int(x)) if x >= 1 and x <= 12 else "")
_ = plt.figure(figsize=(14, 6))
_ = plt.title("Mean Temperature by Month")
ax = sns.lineplot(x="month", y="temp", hue="year", data=year_month_mean_df)
_ = plt.xlabel("Month")
_ = plt.ylabel("Mean Temperature (kelvin)")
_ = ax.xaxis.set_major_formatter(month_tick_formatter)
Because we have many more variable pairs to evaluate, we can start to look at correlations.
corr_df = df.corr()
_ = plt.figure(figsize=(11, 9))
_ = sns.heatmap(corr_df, annot=True, vmin=-1, vmax=1)
_ = plt.figure(figsize=(8, 6))
_ = plt.title("Vapor Pressure Deficit by Temperature")
_ = sns.scatterplot(x="temp", y="vpd", data=df)
_ = plt.xlabel("Temperature (kelvin)")
_ = plt.ylabel("Vapor Pressure Deficit")
_ = plt.figure(figsize=(8, 6))
_ = plt.title("Vapor Pressure Deficit by Humidity")
_ = sns.scatterplot(x="humidity", y="vpd", data=df)
_ = plt.xlabel("Humidity")
_ = plt.ylabel("Vapor Pressure Deficit")
_ = plt.figure(figsize=(8, 6))
_ = plt.title("Strength by Temperature")
ax = sns.scatterplot(x="temp", y="frp", data=df)
_ = plt.xlabel("Temperature (kelvin)")
_ = plt.ylabel("Fire Radiative Power")
_ = plt.figure(figsize=(8, 6))
_ = plt.title("Humidity by Temperature")
_ = sns.scatterplot(x="temp", y="humidity", data=df)
_ = plt.xlabel("Temperature (kelvin)")
_ = plt.ylabel("Humidity")
_ = plt.figure(figsize=(8, 6))
_ = plt.title("Strength by Humidity")
_ = sns.scatterplot(x="humidity", y="frp", data=df)
_ = plt.xlabel("Humidity")
_ = plt.ylabel("Fire Radiative Power")
_ = plt.figure(figsize=(8, 6))
_ = plt.title("Strength by Vapor Pressure Deficit")
_ = sns.scatterplot(x="vpd", y="frp", data=df)
_ = plt.xlabel("Vapor Pressure Deficit")
_ = plt.ylabel("Fire Radiative Power")
_ = plt.figure(figsize=(8, 6))
_ = plt.title("Strength by Confidence")
_ = sns.scatterplot(x="confidence", y="frp", data=df)
_ = plt.xlabel("Confidence")
_ = plt.ylabel("Fire Radiative Power")
_ = plt.figure(figsize=(8, 6))
_ = plt.title("Strength by Month")
ax = sns.scatterplot(x="month", y="frp", data=df)
_ = plt.xlabel("Month")
_ = plt.ylabel("Fire Radiative Power")
_ = ax.xaxis.set_major_formatter(month_tick_formatter)
_ = plt.figure(figsize=(8, 6))
_ = plt.title("Humidity by Month")
ax = sns.scatterplot(x="month", y="humidity", data=df)
_ = plt.xlabel("Month")
_ = plt.ylabel("Humidity")
_ = ax.xaxis.set_major_formatter(month_tick_formatter)
_ = plt.figure(figsize=(8, 6))
_ = plt.title("Vapor Pressure Deficit by Month")
ax = sns.scatterplot(x="month", y="vpd", data=df)
_ = plt.xlabel("Month")
_ = plt.ylabel("Vapor Pressure Deficit")
_ = ax.xaxis.set_major_formatter(month_tick_formatter)
_ = plt.figure(figsize=(8, 6))
_ = plt.title("Strength by Preceding Drought Length")
_ = sns.scatterplot(x="drought_length", y="frp", data=df)
_ = plt.xlabel("Preceding Drought Length")
_ = plt.ylabel("Fire Radiative Power")